Shortest distance between strings - Levenstein algorithm¶

Top horizontal row is always   1,2,... -- cost of insertions [j]
Left vertical column is always 1,2,... -- cost of deletions [i]
begin at upper left (<= 0)
to fill in a cell:    diag     above
                      left     min (above + delete, diag + replace, left + insert)

See: http://www.let.rug.nl/kleiweg/lev/
     https://people.cs.umass.edu/~mccallum/courses/cl2006/lect4-stredit.pdf

Given strings S and T.
Distance is shortest sequence of edit commands that transform S to T,
(or equivalently T to S).
Simple set of operations:
 Copy character from S over to T:       cost = 0
 Delete a character in S:               cost = 1
 Insert a character in T:               cost = 1
 Substitute one character for another:  cost = 1

There are lots of applications of Levenshtein distance. It is used in biology
to find similar sequences of nucleic acids in DNA or amino acids in proteins.
It is used in some spell checkers to guess at which word (from a dictionary)
is meant when an unknown word is encountered.
Wilbert Heeringa's dialectology project uses Levenshtein distance
to estimate the proximity of dialect pronunciations.
And some translation assistance projects have used the alignment
capability of the algorithm in order to discover (the approximate location of)
good translation equivalents.

Initial F:
  F = [ [i + j if i * j == 0 else 0 for j in range(len(B) + 1)] for i in range(len(A) + 1)]
  +-----------------------> j
  |           b  e  n  i  n
  |     [0,   1, 2, 3, 4, 5],
  |  b  [1,   0, 0, 0, 0, 0],
  |  i  [2,   0, 0, 0, 0, 0],
  |  g  [3,   0, 0, 0, 0, 0],
  |  b  [4,   0, 0, 0, 0, 0],
  |  e  [5,   0, 0, 0, 0, 0],
  |  n  [6,   0, 0, 0, 0, 0]
  v
 i

def levenstein(A, B):
   ''' Editor distance
       indel =1       - the cost for inserting or deleting a symbol
       substitution=1 - the cost for substituting one symbol for another
       swap           - the cost for swapping two adjacent symbols.
   '''
   F = [ [i + j if i * j == 0 else 0 for j in range(len(B) + 1)] for i in range(len(A) + 1)]
   for i in range(1, len(A) + 1):            # +1 - "zero" elements in line
       for j in range(1, len(B) + 1):        # +1 - "zero" elements in column
           if A[i-1] == B[j-1]:              # -1 - real indexes
               F[i][j] = F[i-1][j-1]         # == previous diag distance
           else:
               # Adding 1 because of error
               # and min (shortest distance) distance of possible previous iteratins
               F[i][j] = 1 + min(F[i][j-1], F[i-1][j], F[i-1][j-1])
   return F[len(A)][len(B)]

Test¶

A = list("bigben")
B = list("bigben")
[[0,    1, 2, 3, 4, 5, 6],
      +-----------------> j
 [1,    0, 1, 2, 3, 4, 5],
 [2,    1, 0, 1, 2, 3, 4],
 [3,    2, 1, 0, 1, 2, 3],
 [4,    3, 2, 1, 0, 1, 2],
 [5,    4, 3, 2, 1, 0, 1],
 [6,    5, 4, 3, 2, 1, 0]]
min_dist = levenstein(A, B)
print("Minimal distance = {}".format(min_dist))             # Minimal distance = 0

A = list("b2gben")
B = list("b1gben")
 [[0,    1, 2, 3, 4, 5, 6],
       +-----------------> j
 [1,     0, 1, 2, 3, 4, 5],
 [2,     1, 1, 2, 3, 4, 5],       # 2, 2
 [3,     2, 2, 1, 2, 3, 4],
 [4,     3, 3, 2, 1, 2, 3],
 [5,     4, 4, 3, 2, 1, 2],
 [6,     5, 5, 4, 3, 2, 1]]
min_dist = levenstein(A, B)
print("Minimal distance = {}".format(min_dist))             # Minimal distance = 1 (wrong character)

A = list("bigben")
B = list("bgben")
 [[0,    1, 2, 3, 4, 5],
       +-----------------> j
 [1,     0, 1, 2, 3, 4],
 [2,     1, 1, 2, 3, 4],
 [3,     2, 1, 2, 3, 4],
 [4,     3, 2, 1, 2, 3],
 [5,     4, 3, 2, 1, 2],
 [6,     5, 4, 3, 2, 1]]
min_dist = levenstein(A, B)
print("Minimal distance = {}".format(min_dist))             # Minimal distance = 1 (missed character)

A = list("bigben")
B = list("benin")
 b     i g b e n
 b e n i       n
         b  e  n  i  n
 [ [0,   1, 2, 3, 4, 5],
       +-----------------> j
 b [1, | 0, 1, 2, 3, 4], b = b   => [0, 1, 2, 3, 4]
 i [2, | 1, 1, 2, 2, 3], e != i  => 1 + 0
 g [3, | 2, 2, 2, 3, 3],
 b [4, | 3, 3, 3, 3, 4],
 e [5, | 4, 3, 4, 4, 4],
 n [6, | 5, 4, 3, 4, 4]]
min_dist = levenstein(A, B)
print("Minimal distance = {}".format(min_dist))             # Minimal distance = 4

# TBD
# Recover the path from F(N, M) to F(0, 0)